Goto

Collaborating Authors

 machine vision


FaSDiff: Balancing Perception and Semantics in Face Compression via Stable Diffusion Priors

Zhou, Yimin, Xia, Yichong, Chen, Bin, Hong, Mingyao, Li, Jiawei, Wang, Zhi, Wang, Yaowei

arXiv.org Artificial Intelligence

With the increasing deployment of facial image data across a wide range of applications, efficient compression tailored to facial semantics has become critical for both storage and transmission. While recent learning-based face image compression methods have achieved promising results, they often suffer from degraded reconstruction quality at low bit rates. Directly applying diffusion-based generative priors to this task leads to suboptimal performance in downstream machine vision tasks, primarily due to poor preservation of high-frequency details. In this work, we propose FaSDiff (\textbf{Fa}cial Image Compression with a \textbf{S}table \textbf{Diff}usion Prior), a novel diffusion-driven compression framework designed to enhance both visual fidelity and semantic consistency. FaSDiff incorporates a high-frequency-sensitive compressor to capture fine-grained details and generate robust visual prompts for guiding the diffusion model. To address low-frequency degradation, we further introduce a hybrid low-frequency enhancement module that disentangles and preserves semantic structures, enabling stable modulation of the diffusion prior during reconstruction. By jointly optimizing perceptual quality and semantic preservation, FaSDiff effectively balances human visual fidelity and machine vision accuracy. Extensive experiments demonstrate that FaSDiff outperforms state-of-the-art methods in both perceptual metrics and downstream task performance.


Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill

Kurrey, Vaibhav, Pujari, Sivakalyan, Gupta, Gagan Raj

arXiv.org Artificial Intelligence

We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.


Detecting Concept Drift in Neural Networks Using Chi-squared Goodness of Fit Testing

Ayers, Jacob Glenn, Ramanan, Buvaneswari A., Khan, Manzoor A.

arXiv.org Artificial Intelligence

As the adoption of deep learning models has grown beyond human capacity for verification, meta-algorithms are needed to ensure reliable model inference. Concept drift detection is a field dedicated to identifying statistical shifts that is underutilized in monitoring neural networks that may encounter inference data with distributional characteristics diverging from their training data. Given the wide variety of model architectures, applications, and datasets, it is important that concept drift detection algorithms are adaptable to different inference scenarios. In this paper, we introduce an application of the $χ^2$ Goodness of Fit Hypothesis Test as a drift detection meta-algorithm applied to a multilayer perceptron, a convolutional neural network, and a transformer trained for machine vision as they are exposed to simulated drift during inference. To that end, we demonstrate how unexpected drops in accuracy due to concept drift can be detected without directly examining the inference outputs. Our approach enhances safety by ensuring models are continually evaluated for reliability across varying conditions.


The Good Robot podcast: Machine vision with Jill Walker Rettberg

AIHub

Hosted by Eleanor Drage and Kerry McInerney, The Good Robot is a podcast which explores the many complex intersections between gender, feminism and technology. In this episode, we talked to Jill Walker Rettberg, Professor of Digital Culture at the University of Bergen in Norway. In this wide-ranging conversation, we talk about machine vision's origins in polished volcanic glass, whether or not we'll actually have self-driving cars, and that famous photo-shopped Mother's Day photo released by Kate Middleton in March, 2024. Jill Walker Rettberg is Professor of Digital Culture and Co-Director of the Center for Digital Narrative (CDN), a Norwegian Center of Research Excellence that has received a 15 million grant from the Norwegian Research Council (2023-2033). She is also Principal Investigator of the ERC project Machine Vision in Everyday Life: Playful Interactions with Visual Technologies in Digital Art, Games, Narratives and Social Media (2018-2024), and of the ERC Advanced grant project AI Stories: Narrative Archetypes for Artificial Intelligence (2024-2029).


High Efficiency Image Compression for Large Visual-Language Models

Li, Binzhe, Wang, Shurun, Wang, Shiqi, Ye, Yan

arXiv.org Artificial Intelligence

--In recent years, large visual language models (L VLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different L VLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for L VLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, V ersatile Video Coding. Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework. ARGE visual-language models (L VLMs) have shown impressive success in a variety of multi-modal application domains. Images, which are typically featured with high data volume, are typically compressed for transmission before feeding to the L VLMs at the cloud end. Instead of supporting only a single task, L VLMs typically support multi-tasks simultaneously, which brings unprecedented challenges to image coding for machines [1]. In the past decades, as the default visual data communication solutions, existing image and video standards have been developed and facilitated to improve rate-distortion (RD) performance, such as H.264/A VC [2], H.265/HEVC [3], H.266/VVC [4], and A VS [5]. Inspired by the rapid development of deep neural networks, many learning-based image and video codecs are proposed [6]-[10], which have achieved comparable and even better RD performance compared with VVC [11], [12].


Low Cost Machine Vision for Insect Classification

Brandt, Danja, Tschaikner, Martin, Chiaburu, Teodor, Schmidt, Henning, Schrimpf, Ilona, Stadel, Alexandra, Beckers, Ingeborg E., Haußer, Frank

arXiv.org Artificial Intelligence

Preserving the number and diversity of insects is one of our society's most important goals in the area of environmental sustainability. A prerequisite for this is a systematic and up-scaled monitoring in order to detect correlations and identify countermeasures. Therefore, automatized monitoring using live traps is important, but so far there is no system that provides image data of sufficient detailed information for entomological classification. In this work, we present an imaging method as part of a multisensor system developed as a low-cost, scalable, open-source system that is adaptable to classical trap types. The image quality meets the requirements needed for classification in the taxonomic tree. Therefore, illumination and resolution have been optimized and motion artefacts have been suppressed. The system is evaluated exemplarily on a dataset consisting of 16 insect species of the same as well as different genus, family and order. We demonstrate that standard CNN-architectures like ResNet50 (pretrained on iNaturalist data) or MobileNet perform very well for the prediction task after re-training. Smaller custom made CNNs also lead to promising results. Classification accuracy of $>96\%$ has been achieved. Moreover, it was proved that image cropping of insects is necessary for classification of species with high inter-class similarity.


Intelligent Robotic Control System Based on Computer Vision Technology

Che, Chang, Zheng, Haotian, Huang, Zengyi, Jiang, Wei, Liu, Bo

arXiv.org Artificial Intelligence

Computer vision is a kind of simulation of biological vision using computers and related equipment. It is an important part of the field of artificial intelligence. Its research goal is to make computers have the ability to recognize three-dimensional environmental information through two-dimensional images. Computer vision is based on image processing technology, signal processing technology, probability statistical analysis, computational geometry, neural network, machine learning theory and computer information processing technology, through computer analysis and processing of visual information.The article explores the intersection of computer vision technology and robotic control, highlighting its importance in various fields such as industrial automation, healthcare, and environmental protection. Computer vision technology, which simulates human visual observation, plays a crucial role in enabling robots to perceive and understand their surroundings, leading to advancements in tasks like autonomous navigation, object recognition, and waste management. By integrating computer vision with robot control, robots gain the ability to interact intelligently with their environment, improving efficiency, quality, and environmental sustainability.


Probabilistic Multimodal Depth Estimation Based on Camera-LiDAR Sensor Fusion

Obando-Ceron, Johan S., Romero-Cano, Victor, Monteiro, Sildomar

arXiv.org Artificial Intelligence

Multi-modal depth estimation is one of the key challenges for endowing autonomous machines with robust robotic perception capabilities. There have been outstanding advances in the development of uni-modal depth estimation techniques based on either monocular cameras, because of their rich resolution, or LiDAR sensors, due to the precise geometric data they provide. However, each of these suffers from some inherent drawbacks, such as high sensitivity to changes in illumination conditions in the case of cameras and limited resolution for the LiDARs. Sensor fusion can be used to combine the merits and compensate for the downsides of these two kinds of sensors. Nevertheless, current fusion methods work at a high level. They process the sensor data streams independently and combine the high-level estimates obtained for each sensor. In this paper, we tackle the problem at a low level, fusing the raw sensor streams, thus obtaining depth estimates which are both dense and precise, and can be used as a unified multi-modal data source for higher level estimation problems. This work proposes a Conditional Random Field model with multiple geometry and appearance potentials. It seamlessly represents the problem of estimating dense depth maps from camera and LiDAR data. The model can be optimized efficiently using the Conjugate Gradient Squared algorithm. The proposed method was evaluated and compared with the state-of-the-art using the commonly used KITTI benchmark dataset.


Machine Vision Using Cellphone Camera: A Comparison of deep networks for classifying three challenging denominations of Indian Coins

Joshi, Keyur D., Shah, Dhruv, Shah, Varshil, Gandhi, Nilay, Shah, Sanket J., Shah, Sanket B.

arXiv.org Artificial Intelligence

Indian currency coins come in a variety of denominations. Off all the varieties Rs.1, RS.2, and Rs.5 have similar diameters. Majority of the coin styles in market circulation for denominations of Rs.1 and Rs.2 coins are nearly the same except for numerals on its reverse side. If a coin is resting on its obverse side, the correct denomination is not distinguishable by humans. Therefore, it was hypothesized that a digital image of a coin resting on its either size could be classified into its correct denomination by training a deep neural network model. The digital images were generated by using cheap cell phone cameras. To find the most suitable deep neural network architecture, four were selected based on the preliminary analysis carried out for comparison. The results confirm that two of the four deep neural network models can classify the correct denomination from either side of a coin with an accuracy of 97%.


Robotics Engineer - Machine Vision - AI Jobs

#artificialintelligence

RFA Engineering (www.rfamec.com) is seeking talent in the field of Robotics, Perception, Vision Processing and Machine Learning to architect, develop and integrate new intelligent features into the next generation of agricultural and off-highway equipment. From this mid-western location, you could work as part of a global team of world-class engineers and researchers that are leading the implementation of autonomous and semi-autonomous technology in their industry. You will be working with a high-velocity team of multi-disciplined engineers, developers and architects that are developing new applications using state-of-the-art technologies including 3D vision systems, machine learning, sensor fusion technology, FPGA's and GPU's. This is an excellent growth opportunity for anyone interested in these emerging technologies. Our primary focus is product development of off highway equipment including agricultural, construction, mining, recreational, industrial, and special machines.